Several of the top data scientist’s information was stored in in a CSV file and was tidy up using dplyr libraries.
DBI was used to create tables on the fly with an Azure SQL DB for MySQL. The databases were created from R and all insert statements were completed from R. The following tables were created:
Skills Table Person Table
The tables were created with a Many to Many relationship.
skillid skillname
1 1 Research
2 2 Statistics
3 3 Business Strategy
4 4 Data Analysis
5 5 Python
6 6 Microsoft Excel
7 7 Mathematica
8 8 Java
9 9 R
10 10 Machine Learning
Below is a list of the people being pulled from a database.
mysql_datascientists <- dbGetQuery(con, ‘select name,education,degree,title,company,location from person limit 10;’) We limit the results to only 10 rows.
name education degree
1 Cassie Kozyrkov Duke University PhD
2 Elena Grewal Stanford University PhD
3 Lukas Hansen Northwestern University MBA
4 Mike Zhu NYU MS
5 William Roberts NYU BA
6 Daniel Law Amherst College BA
7 HiJae Kim Carnegie Mellon BS
8 Scott Ogden Stony Brook University MS
9 Naoki Ishibashi CUNY MS
10 Eric Morin UCSD BA
title company location
1 Chief Decision Scientist Google New York
2 Head of Data Science Airbnb San Francisco
3 Data Scientist Aetna New York
4 Data Science Associate Oscar Health New York
5 Data Scientist Cityblock Health New York
6 Data Scientist Facebook San Francisco
7 Data Scientist DailyPay New York
8 Lead Data Scientist Healthfirst New York
9 AI Engineer, Data Analyst Softbank Tokyo
10 Data Scientist Rakuten Tokyo
---
title: "Project 3 - Data Science Skills"
author: "David Apolinar, Anthony Muñoz, Christina Valore, Omar Pineda."
date: "3/24/2019"
output:
flexdashboard::flex_dashboard:
orientation: rows
source_code: embed
---
Row {.tabset .tabset-fade}
-------------------------------------
### Overview
### Data Load
Several of the top data scientist's information was stored in in a CSV file and was tidy up using dplyr libraries.
```{r data load}
library(tidyr)
library(wordcloud)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(magrittr)
library(DBI)
library(dplyr)
# Data Files
ds <- read.csv("https://raw.githubusercontent.com/omarp120/DATA607Week8/master/DataScientists.csv", header=TRUE, stringsAsFactors = FALSE)
tidyDS <- gather(ds, "Number", "Skill", Skill1:Skill50) #makes data tall
finalDS <- tidyDS[tidyDS$Skill != "",] #removes rows with empty skill values
counts <- as.data.frame(table(finalDS$Skill)) #creates a data frame with skill frequencies
```
### SQL Creation and Insertion
DBI was used to create tables on the fly with an Azure SQL DB for MySQL. The databases were created from R and all insert statements were completed from R. The following tables were created:
Skills Table
Person Table
The tables were created with a Many to Many relationship.
```{r SQL dataload}
# Create Skill Table
skilltable <- unique(finalDS$Skill)
skilltable <- as.data.frame(skilltable, stringsAsFactors = FALSE)
skillids <- 1:nrow(skilltable)
skilltable <- cbind.data.frame(skilltable,skillids)
names(skilltable) <- c("SkillName", "SkillID")
# Run SQL statements to create tables
con <- dbConnect(RMariaDB::MariaDB(), user='x-admin@cunyspsmysql.mysql.database.azure.com', password="7dWa0XUVHtlCJMS", dbname='datascientists' ,host='cunyspsmysql.mysql.database.azure.com')
rs<- dbSendStatement(con, "drop table if exists person_skills;")
dbClearResult(rs)
rs<-dbSendStatement(con, "drop table if exists person;")
dbClearResult(rs)
rs<-dbSendStatement(con, "drop table if exists skills;")
dbClearResult(rs)
rs <- dbSendStatement(con, "CREATE TABLE person (
personid int NOT NULL auto_increment primary key,
title nchar(50),
name nchar(50) NOT NULL,
education nchar(50),
degree nchar(50),
location nchar(50),
company nchar(50));")
dbClearResult(rs)
rs<- dbSendStatement(con, "CREATE TABLE skills (
skillid int NOT NULL auto_increment primary key,
skillname nchar(50) NOT NULL);")
dbClearResult(rs)
rs<- dbSendStatement(con, "CREATE TABLE person_skills (
personid int NOT NULL references person(personid),
skillid int NOT NULL references skills(skillid),
CONSTRAINT person_skill primary key(personid, skillid));")
dbClearResult(rs)
dbDisconnect(con)
# Create SQL Connection
con <- dbConnect(RMariaDB::MariaDB(), user='x-admin@cunyspsmysql.mysql.database.azure.com', password="7dWa0XUVHtlCJMS", dbname='datascientists' ,host='cunyspsmysql.mysql.database.azure.com')
#mysql_datascientists <- dbGetQuery(con, 'select * from skills')
for(i in 1:nrow(skilltable))
{
# print(paste0("Inserting Skill: ", skilltable[i,]$SkillName, ", SkillID: ", skilltable[i,]$SkillID) )
sql <- sprintf("insert into skills
(skillname, skillid)
values ('%s', %d);",
skilltable[i,]$SkillName, skilltable[i,]$SkillID)
rs <- dbSendQuery(con, sql)
dbClearResult(rs)
}
mysql_dataskills <- dbGetQuery(con, 'select * from skills
limit 10')
mysql_dataskills
dbDisconnect(con)
```
Below is a list of the people being pulled from a database.
mysql_datascientists <- dbGetQuery(con, 'select name,education,degree,title,company,location from person
limit 10;')
We limit the results to only 10 rows.
```{r insert peope}
# Get Unique People to Insert
con <- dbConnect(RMariaDB::MariaDB(), user='x-admin@cunyspsmysql.mysql.database.azure.com', password="7dWa0XUVHtlCJMS", dbname='datascientists' ,host='cunyspsmysql.mysql.database.azure.com')
people_table <- finalDS %>% select(ID,Person, Title, School, HighestLevel, Location, Company) %>% unique()
for(i in 1:nrow(people_table))
{
# print(paste0("Inserting Person: ",
# people_table[i,]$Person, ", Title: ",
# people_table[i,]$Title, "School: ",
# people_table[i,]$School, ", Degree: ",
# people_table[i,]$HighestLevel, ", Location: ",
# people_table[i,]$Location, ", Company: ",
# people_table[i,]$Company))
sql <- sprintf("insert into person
(name, title, education, degree, location, company)
values ('%s', '%s', '%s','%s', '%s', '%s');",
people_table[i,]$Person,
people_table[i,]$Title,
people_table[i,]$School,
people_table[i,]$HighestLevel,
people_table[i,]$Location,
people_table[i,]$Company)
rs <- dbSendQuery(con, sql)
dbClearResult(rs)
}
mysql_datascientists <- dbGetQuery(con, 'select name,education,degree,title,company,location from person
limit 10;')
mysql_datascientists
dbDisconnect(con)
# Create Many to Many Relationship
linkdb<- tidyDS %>% select(ID, Skill)
returnIndex <- function(n)
{
for(i in 1:nrow(n))
{
return (skilltable$SkillID[skilltable$SkillName == n[i,]$Skill])
}
}
# Remove duplicate rows
person_skill <- finalDS %>% select(ID, Person, Skill) %>% distinct()
#returnIndex(linkdb[478,])
# Create Link Table
con <- dbConnect(RMariaDB::MariaDB(), user='x-admin@cunyspsmysql.mysql.database.azure.com', password="7dWa0XUVHtlCJMS", dbname='datascientists' ,host='cunyspsmysql.mysql.database.azure.com')
for(i in 1:nrow(person_skill))
{
if(length(returnIndex(person_skill[i,])) != 0)
{
# print(paste0("Inserting (PersonID: ", person_skill[i,]$ID, " SkillID: ", returnIndex(person_skill[i,]),")") )
sql <- sprintf("insert into person_skills
(personid, skillid)
values (%d, %d);",
person_skill[i,]$ID, returnIndex(person_skill[i,]))
rs <- dbSendQuery(con, sql)
dbClearResult(rs)
}else
{
print("Empty Skill Value, skipping link")
}
}
dbDisconnect(con)
```
### Web Scraping Information
### Word Cloud
```{r data vis}
colnames(counts) <- c("Skill", "Freq")
wordcloud(counts$Skill, counts$Freq, random.order = FALSE, scale = c(2, 0.10), colors=brewer.pal(8, "Dark2"))
```
### Data Scientists Common Locations
```{r goecode}
#code adapted from http://www.storybench.org/geocode-csv-addresses-r/
#library(ggmap)
#register_google(key = "xxx") #removed personal API key
# Initialize the data frame
#getOption("ggmap")
# Loop through the addresses to get the latitude and longitude of each address and add it to the
# origAddress data frame in new columns lat and lon
#for(i in 1:nrow(ds))
#{
# Print("Working...")
# result <- geocode(ds$Location[i], output = "latlon", source = "google")
# ds$lon[i] <- as.numeric(result[1])
# ds$lat[i] <- as.numeric(result[2])
#}
# Write a CSV file containing origAddress to the working directory
#write.csv(ds, "geocoded.csv", row.names=FALSE)
```
```{r map}
library(leaflet)
cities <- read.csv("https://raw.githubusercontent.com/omarp120/DATA607Week8/master/geocoded.csv")
cities %>%
leaflet() %>%
addTiles() %>%
addMarkers(clusterOption=markerClusterOptions())
```
### Conclusion